Fast Retrieval of Similar Subsequences in Long Sequence Databases
نویسندگان
چکیده
Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L of data sequences. In this paper, we propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. The experiments on the synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed the sequential scanning and achieved up to 6.5 times speed-up.
منابع مشابه
gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملSimilarity search of time-warped subsequences via a suffix tree
This paper proposes an indexing technique for fast retrieval of similar subsequences using the time warping distance. The time warping distance is a more suitable similarity measure than the Euclidean distance in many applications where sequences may be of different lengths and/or different sampling rates. The proposed indexing technique employs a disk-based suffix tree as an index structure an...
متن کاملFast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases
We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled by any suitable amount and its offset adjusted appropriately. Two subsequences are considered si...
متن کاملFaster sequence homology searches by clustering subsequences
MOTIVATION Sequence homology searches are used in various fields. New sequencing technologies produce huge amounts of sequence data, which continuously increase the size of sequence databases. As a result, homology searches require large amounts of computational time, especially for metagenomic analysis. RESULTS We developed a fast homology search method based on database subsequence clusteri...
متن کاملSimilarity-Based Subsequence Search in Image Sequence Databases
This paper proposes an indexing technique for fast retrieval of similar image subsequences using the multi-dimensional time warping distance. The time warping distance is a more suitable similarity measure as compared to the Lp distance in many applications where sequences may be of different lengths and/or different sampling rates. Our indexing scheme employs a disk-based suffix tree as an ind...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999